Random Forests belong to the class of ensemble methods. The goal of ensemble methods is to combine the predictions of several base estimators built with a give learning algorithm in order to improve generalizability/ robustness over a single estimator.
There are two families of ensemble methods:
Average methods: build several estimators independently and then to average their predictions. Examples: Bagging methods, Forest of randomized trees.
Boosting methods: base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Examples: AdaBoost, Gradient Tree Boosting,...
Build several instances of a blackbox estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction.
In scikit-learn
, bagging methods are offered as a unified BaggingClassifier
meta-estimator, taking as input a user-specified base estimator along with parameters specifying the strategy to draw random subsets:
In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), max_samples = 0.5, max_features=0.5)
The sklearn.ensemble
module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. These two has perturb-and-combine style: a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.
In [3]:
from sklearn.ensemble import RandomForestClassifier
X = [[0,0], [1,1]]
Y = [0, 1]
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X,Y)
Each tree is the ensemble built from a sample drawn with replacement from the training set.
The scikit-learn implementation combines classifiers by averaging their probablistic prediction, instead of letting each classifier vote for a single class.
Randomness goes one step further in the way splits are computed.
As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random at each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule.
This allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.
n_estimators
and max_features
.
n_estimators
is the number of tres in the forest. The larger the better, but also the longer it will take to compute.
max_features
is the size of random subsets of features to consider when splitting a node. The lower the greater the reduction of variance, but also the greater in increase in bias.
Empirical good default values are max_features=n_features
for regression problems, and max_features=sqrt(n_features)
for classification tasks.
Good results are often achieved when max_depth=None
in combination with min_samples_split=1
.
The best parameter values should always be cross-validated.
In random forests, bootstrap samples are used by default (bootstrap=True
) while the default strategy for extra-trees is to use the whole dataset (bootstrap=False
).
n_jobs
If n_jobs=k
then computations are partitioned into k jobs, and run on k cores of the machine.
IF n_jobs=-1
then all cores available on the machine are used.
In [ ]: